Real estate is an asset that people would like to invest in; as it not only brings about financial benefits but is also an asset for long-term sustainable goals. Understanding the market trend and the value of properties will suggest optimal choices for house sellers and buyers to optimise their purchase while having a comfortable experience that is suitable with their needs.
However, the dynamics of real estate and houses are influenced by multiple factors. This study would test four important aspects: the property’s age, size of the house, price trends of the market, and a unique, premium feature - waterfront. Specifically, we would like to perform tests on 04 main issues: (1) The market preference for the house’s size through observing the median; (2) The market preference for the house’s premium feature - waterfront - through observing the proportion in transactions; (3) The impact of housing ages on the conditions through observing the association; (4) the price trends between innovated houses and newly constructed houses. By addressing these topics, we would like to form a source for related stakeholders (sellers, buyers, investors, brokers) to have a comprehensive picture of the USA property market trends at a specific time point to refer to their decisions when it comes to housing.
We will be using the dataset “US House Price”. This dataset gives information about each house transaction in the Seattle Metropolitan Area in the USA from May to July 2014. The necessary variables we will be testing:
tidyverse: A collection of packages
for data manipulation, visualization, and analysis.readxl: Used for reading Excel
files.infer: A package for statistical
inference.DT: A package for creating interactive
data tables.dplyr: A core package in the tidyverse
for data manipulation.ggplot2 and
plotly: Two packages to visualize data and
we use plotly to make the data more interactive.janitor: A package used to clean the
datalibrary(readxl)
library(dplyr)
library(tidyverse)
library(infer)
library(DT)
library(ggplot2)
library(plotly)
library(janitor)
We import our file in use:
master_housing <- read_csv("USA Housing Dataset (1).csv")
There was a discussion over whether prospective buyers should choose large houses or small home areas. While a spacious house can offer more room for welcoming guests, hosting bonding activities, or meeting additional needs, affordability, disconnectivity and insufficient resources present drawbacks hindering house hunters from making purchase decisions. (Shalom, 2018) As a result, budget, preferences and needs critically determine housing choices. (SuperAdmin, 2022)
These considerations are relevant to the Seattle metropolitan area, where the house size is reported to have been decreasing for years. A significant factor contributing to the phenomenon is the rise of townhouse construction. As of 2000, 40% of homes in the region were townhouses, and ongoing shifts in development patterns have reduced the average lot size by 30% over the past two decades. (Gatea, 2021). This has further increased the price of houses, as it was already expensive due to the strong economy with technological, healthcare, or maritime industries, prestigious education, the supply-demand imbalance, etc. (Fox et al., 2024).
Given the continuous decline in house sizes, it is reasonable to hypothesize that the median house size in the Seattle Metropolitan Area may differ from other areas than its state (Washington)’s median. The median house size of Washington, 2,185 sq. ft in 2022, serves as a valuable point of comparison. By examining whether Seattle’s median house size diverges from this benchmark, we can better understand the impacts of unique market pressures in Seattle’s housing market. The median house size of Washington, 2,185 sq. ft in 2022 (NeoMam Studios, 2022), serving as a valuable point of comparison. Since Seattle is the state’s largest metropolitan hub, its housing patterns could reflect broader state trends. However, given the city’s unique housing pressures — such as higher land prices, increased demand for urban living, and the rise of compact townhouses — Seattle’s house sizes may deviate from the state median. Regarding the area, because of its growing economic and service development, which may lead to higher house price, it can be assumed that the house size can be smaller than the state’s median. By testing whether the median house size in the Seattle Metropolitan Area aligns with the state’s median, we can assess if Seattle’s housing market follows statewide norms or if it exhibits distinct characteristics. This distinction can offer valuable insight into urban development, housing affordability, and the spatial efficiency of homes in metropolitan areas compared to more suburban or rural parts of the state.
null_hypothesis_1 <- master_housing %>%
specify(response = sqft_living) %>%
hypothesize(null = "point", med = 2185)
The sample median was selected over the mean to avoid potential effects of potential outliers, especially with datasets containing extreme values. Moreover, due to its little sensitivity to outliers, it could provide a more reflective representation of the central tendency for such data.
To assess the statistical significance of the dataset, bootstrap sampling method was our option. Since the method does not require the data to follow a known, specific distribution, we can generate empirical distribution of the median and assess the statistical significance of the observed sample median.
Using the bootstrapping method, we resampled the dataset 1,000 times with replacement, generating 1,000 median values. Then, assuming that 2,185 as the sample median, we shifted the null distribution, so 2,185 square feet became the central value. As the standard deviation remains intact during the process, we can proceed to test whether the observed value is likely to occur under this expected scenario.
null_distribution_1 <- null_hypothesis_1 %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "median")
null_distribution_1
## Response: sqft_living (numeric)
## Null Hypothesis: point
## # A tibble: 1,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 2195
## 2 2 2195
## 3 3 2175
## 4 4 2189
## 5 5 2185
## 6 6 2187
## 7 7 2175
## 8 8 2165
## 9 9 2215
## 10 10 2185
## # ℹ 990 more rows
# Create a ggplot histogram
null_graph_1 <- ggplot(null_distribution_1, aes(x = stat)) +
geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
labs(title = "Bootstrap Distribution of Median Square Footage",
x = "Median Square Footage (sqft)",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif"))
# Convert ggplot to plotly for interactivity
ggplotly(null_graph_1)
Figure 3.1.3.2. The bootstrap distribution of median square footage.
We computed the observed median house size in the area from the column sqft_living (the living area in the unit of square feet). The recorded data was 1,980 sq. feet.
observed_hypothesis_1 <- master_housing %>%
specify(response = sqft_living) %>%
calculate(stat = "median")
observed_hypothesis_1
## Response: sqft_living (numeric)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 1980
To derive the p-value, we compared the observed statistic to the null distribution. The p-value equals to the probability that the value equal or more extreme than the observed statistic appears compared to the null distribution dataset; in this case, we have: \(p_{value} = \frac{N(\mu \leq 2185)}{N}\) (here: \(p = 0\)).
# Calculate p-value
p_value_1 <- null_distribution_1 %>%
get_p_value(obs_stat = observed_hypothesis_1, direction = "less")
# Show the p value in 4 decimals
p_value_1 <- round(p_value_1$p_value, 4)
p_value_1
## [1] 0
Then, we add up the visualization of the observed statistic to the null distribution:
# Visualize the null distribution with shaded p-value and additional customization
ggplotly(null_distribution_1 %>%
visualize() +
shade_p_value(obs_stat = observed_hypothesis_1, direction = "less") +
geom_vline(aes(xintercept = observed_hypothesis_1$stat), color = "darkred", linetype = "dashed", size = 1) +
labs(title = "Null Distribution of Median Square Footage with Observed Statistic",
x = "Median Square Footage (sqft)",
y = "Frequency") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", family = "Times", color = "darkred"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) +
scale_fill_manual(values = c("blue")))
Figure 3.1.3.4. Null Distribution of Median Square Footage with Observed Statistic.
Having derived that \(p_{value} < 0.05\), we rejected the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area is equal to 2,185 square feet. The probability of observing a median equal to or lower than 2,185 square feet is extremely low under the null hypothesis (rare); therefore, it is unlikely that the null hypothesis is true.
# Set the significance level
alpha <- 0.05
# Test conclusion
if (p_value_1 < alpha) {
conclusion <- "Reject the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area is significantly fewer than 2,185 square feet."
} else {
conclusion <- "Fail to reject the null hypothesis: There is not enough evidence to conclude that the median house size of municipal cities in the Seattle Metropolitan Area is significantly fewer than 2,185 square feet."
}
# Display the conclusion
conclusion
## [1] "Reject the null hypothesis: The median house size of municipal cities in the Seattle Metropolitan Area is significantly fewer than 2,185 square feet."
To consolidate the conclusion and figure out the range of the median size’s possible values, we tested the data with the confidence interval of values. Using the similar boostrapping method like Step 3 (without shifting the distribution), we have the actual boostrap distribution. In this distribution, we noted our \(95\%\) confidence interval, counted between the quantile \(2.5\%\) and \(97.5\%\) of the distribution (excluding \(5\%\) of extreme values). The \(95\%\) confidence interval for the median house size ranged from 1950 to 2010 square feet, which is below 2185. Therefore, we have statistical evidence to support the alternative hypothesis that the median house size in the Seattle Metropolitan Area was less than 2185.
# Calculate 95% confidence interval
boot_distn_one_median <- master_housing %>%
specify(response = sqft_living) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "median")
ci <- boot_distn_one_median %>%
get_ci()
ci
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 1950 2010
# Visualize the bootstrap distribution with confidence interval and make it interactive
ggplot(boot_distn_one_median, aes(x = stat)) +
geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
geom_vline(aes(xintercept = ci$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_vline(aes(xintercept = ci$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_text(aes(x = ci$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
geom_text(aes(x = ci$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Bootstrap Distribution of Median Square Footage\nwith Confidence Interval",
x = "Median Square Footage",
y = "Frequency", size = 14) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.major = element_line(color = "gray80"),
panel.grid.minor = element_line(color = "gray90")) +
annotate("rect", xmin = ci$lower_ci, xmax = ci$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
Figure 3.1.3.6. Bootstrap Distribution of Median Square Footage with Confidence Interval.
Based on our analysis, we have strong statistical evidence that the median house size in cities within the Seattle Metropolitan Area is smaller than 2,185 square feet. The conclusion is supported by two evidence:
Insights on House Size Trends in the Seattle Metropolitan Area are included but not unlimited above:
1. Deviation from the general state assumption:
The analysis reveals the smaller median house size in the Seattle Metropolitan Area compared to that in the Washington state. The conclusion might pose a challenge that local median sizes align with state-level figures.
2. Possible explanations for smaller house sizes in the area (besides increasing townhouse model):
Population growth and workforce demand: The number of workers has been growing because of the presence of attractive high-tech companies. This population increase drives the need for smaller and more affordable options to accommodate the influx of residents.
Dynamics in living arrangement: We can see the rise of new house types, such as apartments, multi-family unit, and multi-generational living, reducing the house size of this region.
Having highlighted the customers’ housing behavior and trends, businesses and policymakers should pay more attention for future housing development, payment planning, and market strategies.
1. Data relevance
The benchmarks for comparison were extracted from data in 2022, different from our tested dataset (Quarter 2, 2014). The time gap can affect our relevance of the comparison due to potential changes in housing trends over time.
2. Data Representation
The sample size we used was small, potentially leading to data bias and limiting the generality of the results.To reflect the housing trends, a larger sample size would be required.
3. The use of Central Median as a measure
The test cannot capture the distributional characteristics of house size (because of discarding outliers). Therefore, we could not approach the dataset comprehensively to analyze its other statistical features.
Buyers should adjust their size expectations in the area because houses are even generally smaller than typical houses in the Washington state. Also, if there are too few options, they should consider diversifying their options in house types and neighborhood selection, or looking for an alternative group. By adopting a more flexible approach, they could make a better decision rather than sticking to unmatched options like houses in the Seattle Metropolitan Area.
Salesmen should strategize the pricing and selling direction more carefully. Some suggestions for them include utilizing larger house size for the key target audience (when there are many small houses, the presence of larger houses will be outstanding), or maximizing small houses’ functions (their houses will have the USP to compete other similar houses).
For example, they can enlist large properties as a premium offer that attracts those preferring spacious living for the first case, or highlight key features of the house to foster the market competitiveness.
Waterfront properties are widely regarded as premium real estate features because of higher market values and unique lifestyle benefits, such as scenic views, privacy, and rental income potential. However, due to its high value, unique location, and limited availability, this kind of house is always in high demand, while the supply is rather limited. (Amres, 2024)
To understand the extent to which the availability of this kind of house, we would like to know what the proportion of houses with waterfront compared to all houses was. Through research, an article stated that “This (waterfront houses) is a rare kind of home. In a given year, about 0.4 percent to 0.6 percent of all property transactions are for houses on the water.” (Forbe, 2018). To validate the precision of the information, we would like to perform a hypothesis testing on the claim that “The proportion of houses with waterfront takes up \(0.6\%\) of all houses transactions” on the dataset for USA’s House Transactions from May to July 2014.
# Convert the waterfront column to a factor with levels "0" and "1"
master_housing$waterfront <- factor(master_housing$waterfront, levels = c("0", "1"))
# Specify the null hypothesis
null_hypothesis_2 <- master_housing %>%
specify(response = waterfront, success = "1") %>%
hypothesize(null = "point", p = 0.006)
In the dataset “USA House Price”, waterfront is a variable would be used that is described as follows:
“ A binary indicator showing whether the property has a waterfront view (1 for yes, 0 for no). Waterfront properties often enjoy higher valuations due to their desirability.”
# Define the null hypothesis proportion
null_hypothesis_2 <- 0.006
# Generate the null distribution
null_distribution_2 <- master_housing %>%
specify(response = waterfront, success = "1") %>%
hypothesize(null = "point", p = null_hypothesis_2) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
head(null_distribution_2)
## Response: waterfront (factor)
## Null Hypothesis: point
## # A tibble: 6 × 2
## replicate stat
## <int> <dbl>
## 1 1 0.00556
## 2 2 0.00362
## 3 3 0.00652
## 4 4 0.00531
## 5 5 0.00797
## 6 6 0.00604
ggplotly(ggplot(null_distribution_2, aes(x = stat)) +
geom_histogram(binwidth = 0.001, fill = "#81bfda", color = "black", alpha = 0.7) +
labs(title = "Histogram of the Null Distribution",
x = "Proportion of Waterfront Houses",
y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")))
Figure 3.2.3.2. Histogram of the Null Distribution (given the null hypothesis of hypothesis 2 is true).
Observed statistics is the value calculated from the given data sample, that is to calculate the proportion of houses with waterfront over total houses. We got a number of 0.0075, which means that \(0.75\%\) of house transactions in the dataset are houses with waterfront.
observed_statistic_2 <- master_housing %>%
specify(response = waterfront, success = "1") %>%
calculate(stat = "prop")
observed_statistic_2
## Response: waterfront (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 0.00749
p_value <- null_distribution_2 %>%
get_p_value(obs_stat = observed_statistic_2$stat, direction = "greater")
p_value
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.138
Then, we add up the visualization of the observed statistic to the null distribution:
null_distribution_2 %>%
visualize() +
shade_p_value(obs_stat = observed_statistic_2$stat, color = "darkred", direction = "greater") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", family = "Times", color = "darkred"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) +
scale_fill_manual(values = c("blue"))
Figure 3.2.3.4. Null Distribution of Proportion of Houses having Waterfront view with Observed Statistic.
# Set the significance level
alpha <- 0.05
# Test conclusion
if (p_value$p_value < alpha) {
conclusion <- "Reject the null hypothesis: The proportion of waterfront homes in the sample is significantly higher than the hypothesized 0.6%"
} else {
conclusion <- "Fail to reject the null hypothesis: There is not enough evidence to conclude that the proportion of waterfront homes in the sample is significantly different from the hypothesized 0.6%."
}
# Display the conclusion
conclusion
## [1] "Fail to reject the null hypothesis: There is not enough evidence to conclude that the proportion of waterfront homes in the sample is significantly different from the hypothesized 0.6%."
For this step, we generated a bootstrap distribution for the sample proportion of waterfront houses by resampling the data 10,000 times.
The result for confidence interval: in the range \([0.0051, 0.0101]\); with the lower bound of 0.0051 and upper bound of 0.0101. This means that the true proportion of waterfront houses is likely to fall within this range \([0.0051, 0.0101]\).
And \(p = 0.006\) does fall in the range, so \(p =0.006\) can be one plausible value for the true proportion. Also, we still do not have enough evidence to reject the null hypothesis as the observed data does not show a significant difference from the null hypothesis value.
# Generate bootstrap distribution for one proportion
boot_distn_one_prop <- master_housing %>%
specify(response = waterfront, success = "1") %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "prop")
# Calculate the confidence interval
ci_2 <- boot_distn_one_prop %>%
get_ci()
# Display the confidence interval
ci_2
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 0.00507 0.0101
ggplot(boot_distn_one_prop, aes(x = stat)) +
geom_histogram(binwidth = 0.001, fill = "#81bfda", color = "black", alpha = 0.7) +
geom_vline(aes(xintercept = ci_2$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_vline(aes(xintercept = ci_2$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_text(aes(x = ci_2$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
geom_text(aes(x = ci_2$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Simulation-based Bootstrap Distribution",
x = "proportion",
y = "count", size = 14) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.major = element_line(color = "gray80"),
panel.grid.minor = element_line(color = "gray90")) +
annotate("rect", xmin = ci_2$lower_ci, xmax = ci_2$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
**Figure 3.2.3.6. Bootstrap Distribution of Proportion of Houses having Waterfront view with Confidence Interval.
Lack of sources for references: The null hypothesis of 0.6% might not be the most appropriate for this context as the dataset is specified in Washington DC.; but 0.6% is from the USA as a whole. A different null hypothesis could be more suitable, depending on available information and expert knowledge.
Lack of data representation: the data collected is only from states within the Seattle metropolises, which might not be representative of the entire population of transactions, in terms of geographic and socioeconomic factors.
Houses with waterfront could be extremely expensive, so buyers should be realistic about the budget and consider the high costs associated with waterfront properties. Buying a waterfront could bring unique experiences, but limited availability and low supply with other environmental issues can be big shortcomings of houses with waterfront.
Housing age often influences maintenance requirements, safety, and market value, making it a crucial factor for investors to consider when allocating funds to real estate (FasterCapital, n.d.). Some people think that older houses have worse conditions than younger ones. We use the given data to test whether there is a dependence between a house’s condition and its age group. By testing the relationship between houses’ age groups (three age groups) and their conditions, we can determine whether a house’s age is a significant factor in predicting the condition of a property or if other variables, such as location or socioeconomic factors, play a more prominent role. The findings from our analysis can provide valuable insights to guide investment strategies for both local and international investors, including those from Vietnam, looking to navigate the US housing market effectively.
# Add age_group column to master_housing based on the age of houses
master_housing <- master_housing %>%
mutate(age = 2014 - yr_built,
age_group = case_when(
age <= 33 ~ "0-33",
age <= 68 ~ "34-68",
TRUE ~ "69-114"
))
# Display the updated master_housing dataset
master_housing
## # A tibble: 4,140 × 20
## date price bedrooms bathrooms sqft_living sqft_lot floors
## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2014-05-09 00:00:00 376000 3 2 1340 1384 3
## 2 2014-05-09 00:00:00 800000 4 3.25 3540 159430 2
## 3 2014-05-09 00:00:00 2238888 5 6.5 7270 130017 2
## 4 2014-05-09 00:00:00 324000 3 2.25 998 904 2
## 5 2014-05-10 00:00:00 549900 5 2.75 3060 7015 1
## 6 2014-05-10 00:00:00 320000 3 2.5 2130 6969 2
## 7 2014-05-10 00:00:00 875000 4 2 2520 6000 1
## 8 2014-05-10 00:00:00 265000 4 1 1940 9533 1
## 9 2014-05-10 00:00:00 394950 3 2.5 1350 1250 3
## 10 2014-05-11 00:00:00 842500 4 2.5 2160 5298 2.5
## # ℹ 4,130 more rows
## # ℹ 13 more variables: waterfront <fct>, view <dbl>, condition <dbl>,
## # sqft_above <dbl>, sqft_basement <dbl>, yr_built <dbl>, yr_renovated <dbl>,
## # street <chr>, city <chr>, statezip <chr>, country <chr>, age <dbl>,
## # age_group <chr>
The Chi-square test statistic was chosen because it can measure how much the observed frequencies for condition and house age group differ from the expected frequencies under the null hypothesis.
This statistical test was used to evaluate the association between condition and age_group in the actual dataset. It serves as a reference point for comparison with the null distribution in a hypothesis test. Using the Chi-square test of independence, we aim to test the hypothesis that housing age influences the condition of homes in the USA. For instance, older homes might more frequently fall into the “level 1” condition category due to outdated materials and aging infrastructure, while newer homes might dominate the “level 5” category because of modern construction standards and materials.The formula follows the equation below:
\[T = \sum_{i,j} \frac{(E_{ij} - O_{ij})^2}{E_{ij}}\].
# Load necessary library
# Create a table of house conditions and age groups
condition_age_table <- master_housing %>%
tabyl(condition, age_group) %>%
adorn_totals(where = c("row", "col"))
# Display the table
condition_age_table
## condition 0-33 34-68 69-114 Total
## 1 0 1 4 5
## 2 1 15 11 27
## 3 1587 665 344 2596
## 4 184 652 278 1114
## 5 21 187 190 398
## Total 1793 1520 827 4140
# Specify the null hypothesis
null_hypothesis3 <- master_housing %>%
specify(age_group ~ condition) %>%
hypothesize(null = "independence")
The distribution is right-skewed. The skewness suggests that most values of test statistics are clustered near the lower end, with fewer observations as the values increase.
one_null_sample3 <- null_hypothesis3 %>%
generate(reps = 1, type = "permute")
(O_t1=table(one_null_sample3$age_group, one_null_sample3$condition))
##
## 1 2 3 4 5
## 0-33 2 12 1087 517 175
## 34-68 2 9 970 393 146
## 69-114 1 6 539 204 77
sum((O_t1-E_t)^2/E_t)
## [1] 7.289755
null_hypothesis3$condition <- as.factor(null_hypothesis3$condition)
null_hypothesis3$age_group <- as.factor(null_hypothesis3$age_group)
null_distribution3 = null_hypothesis3 %>%
specify(response = age_group, explanatory = condition) %>%
hypothesize(null = "independence") %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "Chisq")
ggplotly(ggplot(null_distribution3, aes(x = stat)) +
geom_histogram(binwidth = 2, fill = "#81bfda", color = "black", alpha = 0.7) +
labs(title = "Histogram of the Null Distribution",
x = "Chi-square test statistic",
y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")))
Figure 3.3.3.2. Histogram of the Null Distribution of Chi Square Test Statistic
(O_t=table(master_housing$age_group, master_housing$condition))
##
## 1 2 3 4 5
## 0-33 0 1 1587 184 21
## 34-68 1 15 665 652 187
## 69-114 4 11 344 278 190
expectedIndependent = function(X) {
n = sum(X)
p = rowSums(X)/sum(X)
q = colSums(X)/sum(X)
return(p %o% q * n) # outer product creates table
}
(E_t=expectedIndependent(table(master_housing$age_group,master_housing$condition)))
## 1 2 3 4 5
## 0-33 2.1654589 11.693478 1124.3063 482.4643 172.37053
## 34-68 1.8357488 9.913043 953.1208 409.0048 146.12560
## 69-114 0.9987923 5.393478 518.5729 222.5309 79.50386
observed_stat3 <- sum((O_t-E_t)^2/E_t)
(p_value3 <- null_distribution3 %>%
get_p_value(obs_stat = observed_stat3, direction = "greater"))
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0
# Calculate the observed chi-squared statistic
observed_stat3 <- data.frame(stat = sum((O_t - E_t)^2 / E_t))
graph3 <- ggplot(null_distribution3, aes(x = stat)) +
geom_histogram(binwidth = 5, col="blue", fill = "blue", alpha = 0.7, boundary = 0) +
geom_vline(aes(xintercept = observed_stat3$stat), color = "darkred", linetype = "dashed", size = 1) +
labs(title = "Histogram of Null Distribution with Observed Statistic",
x = "Test statistics",
y = "Frequency") +
theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")) +
theme_minimal()
ggplotly(graph3)
Figure 3.3.3.4. Histogram of Null Distribution with Observed Statistic.
Since the \(p_{value} = 0 < \alpha = 0.05\), the null hypothesis is rejected.
The rejection of the null hypothesis implies that there is statistical evidence to support the alternative hypothesis that the condition of a house is dependent on its age group.
In other words, the condition ratings (1 to 5) vary across different house age groups, suggesting that older or newer houses might have distinct condition profiles. This dependency might reflect factors such as maintenance trends, construction standards, or wear over time.
# Set the significance level
alpha <- 0.05
# Test conclusion
if (p_value3$p_value < alpha) {
conclusion3 <- "Reject the null hypothesis: The condition of a house (ratings from 1 to 5) is independent of the house's age group."
} else {
conclusion <- "Fail to reject the null hypothesis: The condition of a house depends on its age group (older houses tend to have worse conditions."
}
# Display the conclusion
conclusion3
## [1] "Reject the null hypothesis: The condition of a house (ratings from 1 to 5) is independent of the house's age group."
The bootstrap method is a statistical tool for estimating confidence intervals and assessing the significance of observed results. By resampling the original dataset 1000 times, we generate a distribution that approximates the variability of the test statistic under the null hypothesis.
From the bootstrap distribution, we calculated a 95% confidence interval, including the lower and upper bounds for the test statistic. The comparison between the observed test statistic and the null distribution provides a basis for evaluating whether the observed data aligns with the null hypothesis.
When the \(95\%\) confidence interval for the test statistic lies entirely above the upper bound, it strongly suggests that the observed effect is unlikely to have occurred by random chance under the null hypothesis. This provides strong evidence supporting the alternative hypothesis. The results, which vary with each bootstrap iteration, always provided statistically significant evidence to reject the null hypothesis. The results suggest the relationship between the condition of a house and its age group, highlighting the importance of considering bootstrap-based inference for reliable conclusions in data analysis.
master_housing$age_group <- as.factor(master_housing$age_group)
master_housing$condition <- as.factor(master_housing$condition)
boot_distn3 <- master_housing %>%
specify(age_group ~ condition) %>%
generate(reps = 1000, type = "bootstrap") %>%
calculate(stat = "Chisq")
ci_boot3 <-boot_distn3 %>%
get_ci()
ci_boot3
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 912. 1121.
ggplot(boot_distn3, aes(x = stat)) +
geom_histogram(binwidth = 20, fill = "#81bfda", color = "black", alpha = 0.7) +
geom_vline(aes(xintercept = ci_boot3$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_vline(aes(xintercept = ci_boot3$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_text(aes(x = ci_boot3$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
geom_text(aes(x = ci_boot3$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Simulation-based Bootstrap Distribution",
x = "Test Statistics",
y = "count", size = 14) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.major = element_line(color = "gray80"),
panel.grid.minor = element_line(color = "gray90")) +
annotate("rect", xmin = ci_boot3$lower_ci, xmax = ci_boot3$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
Figure 3.3.3.5.: Simulation-based Bootstrap Distribution (Hypothesis 3).
Result association, not causation: Although the test result shows a relationship between the age group of houses and their condition, this does not imply a causal connection. Our test considers only the chronological age of the house, not the effective age, which reflects the actual condition of the property after renovations and maintenance. With nearly 41% of houses in the dataset having been renovated, the chronological age alone may not accurately represent the true state of these properties.
In addition, other factors, such as renovation quality or maintenance frequency, may also influence the results, and these are not accounted for in the test.
For buyers and investors: When buyers and investors should consider both the chronological age and the effective age. While chronological age indicates how long the house has existed, effective age reflects the true remaining life of a property, accounting for the typical life expectancy of such a building as well as its use (CoreLogic, n.d.). A well-renovated older house may have better condition than a newer one that hasn’t been maintained properly. It is important to always inspect the quality of renovations, assess maintenance records, and evaluate how these factors affect the house’s long-term value and potential costs.
Renovations are often believed to significantly boost a home’s market value, with improvements like modernized kitchens, upgraded bathrooms, and added square footage linked to price premiums. Supporting this, a Ph.D. thesis from the University of Padova found that structural and energy-efficient renovations enhance property value by improving physical attributes and urban appeal. (Xu, L. 2022) Similarly, Cambridge University reported that homes with high Energy Performance Certificate (EPC) ratings sold for up to 14% more, highlighting the market value of energy-efficient upgrades.(Fuerst et al., 2013)
One frequent inquiry is whether renovated homes really sell for more than non-renovated ones. This study aims to investigate this question for the Seattle Metropolitan Area by testing the null hypothesis that there is no significant difference in the average prices of renovated and non-renovated homes.
The motivation for this analysis is to provide house sellers and buyers with evidence-based insights on the financial impact of renovations, helping them make informed decisions by analyzing market data from 1900 to 2014 in the Seattle Metropolitan Area.
\(\mu_R\): Mean price of renovated houses.
\(\mu_N\): Mean price of non-renovated houses.
This step extracts key columns (price (house price), yr_renovated (the year when the house was renovated), yr_built (the year when the house was originally built), and sqft_living (area of living in the square feet unitmom)) to create a new dataset. The data is then grouped by home size using intervals of 300 square feet. The most popular size group is selected for hypothesis testing. This ensures that comparisons are made between houses of similar size, minimizing confounding effects and enhancing the validity of the results.
# Load the necessary library
library(dplyr)
# Divide the price by
master_housing <- master_housing %>%
mutate(price = price / 1000)
# Round the value in price to 1 decimal place
master_housing <- master_housing %>%
mutate(price = round(price, 1))
# Display the updated data frame
master_housing
# Extract the specified columns from the master_housing_city dataframe
data4 <- master_housing %>%
select(price, yr_renovated, yr_built, sqft_living)
data4
# Create a new column 'Sqft_group' with integer labels for the intervals of 300 feet
data4 <- data4 %>%
mutate(Sqft_group = cut(sqft_living,
breaks = seq(0, max(sqft_living, na.rm = TRUE) + 300, by = 300),
include.lowest = TRUE,
labels = seq(300, max(sqft_living, na.rm = TRUE) + 300, by = 300)))
head(data4,5)
# Count the rows of each value in column Sqft_group
sqft_group_counts <- data4 %>%
group_by(Sqft_group) %>%
summarise(count = n())
# Find the largest count
max_count <- max(sqft_group_counts$count)
# Display the results, just top rows
head(sqft_group_counts, 10)
max_count
# Create the dataset named "data5" from data4 with the rows having Sqft_living = (1800:2100ư only
data5 <- data4 %>%
filter(Sqft_group ==2100)
# Display the first 5 rows of data5
data5
# Create a new table with three columns: yr_built, price, and status (renovated or newly built)
price_table <- data5 %>%
mutate(status = ifelse(yr_renovated > 0, "renovated", "non-renovated")) %>%
select(yr_built, price, status) %>%
filter(!is.na(price) & price > 0)
head(price_table)
# Load the necessary library
library(infer)
# Specify the null hypothesis that the price of non-renovated houses and renovated houses is not different
null_hypothesis_4 <- price_table %>%
specify(price ~ status) %>%
hypothesize(null = "independence")
# Generate the null distribution
one_null_sample <- null_hypothesis_4 %>%
generate(reps = 1, type = "permute")
one_null_sample
## Response: price (numeric)
## Explanatory: status (factor)
## Null Hypothesis: independence
## # A tibble: 591 × 3
## # Groups: replicate [1]
## price status replicate
## <dbl> <fct> <int>
## 1 325. renovated 1
## 2 266. non-renovated 1
## 3 290 renovated 1
## 4 392 renovated 1
## 5 249 non-renovated 1
## 6 505 renovated 1
## 7 400 renovated 1
## 8 446. non-renovated 1
## 9 520. non-renovated 1
## 10 288. renovated 1
## # ℹ 581 more rows
one_null_sample %>%
group_by(status) %>%
summarize(mean_price = mean(price))
## # A tibble: 2 × 2
## status mean_price
## <fct> <dbl>
## 1 non-renovated 449.
## 2 renovated 476.
# Corrected R code
one_null_sample %>%
calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
## Response: price (numeric)
## Explanatory: status (factor)
## Null Hypothesis: independence
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 26.1
With 10,000 repetitions, the “generate” function creates a distribution of differences in means assuming the null hypothesis is true (i.e., there is no difference between the two groups).
The ggplot function plots the null
distribution as a histogram. Each bar represents the frequency of a
specific difference in means that could be observed under the null
hypothesis.
# Generate the null distribution
null_distribution_4 <- null_hypothesis_4 %>%
generate(reps = 10000, type = "permute") %>%
calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
null_distribution_4
## Response: price (numeric)
## Explanatory: status (factor)
## Null Hypothesis: independence
## # A tibble: 10,000 × 2
## replicate stat
## <int> <dbl>
## 1 1 -20.9
## 2 2 5.54
## 3 3 -43.7
## 4 4 15.2
## 5 5 7.06
## 6 6 3.24
## 7 7 -13.5
## 8 8 -16.5
## 9 9 -7.41
## 10 10 -19.6
## # ℹ 9,990 more rows
library(plotly)
ggplotly(ggplot(null_distribution_4, aes(x = stat)) +
geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
labs(title = "Histogram of the Null Distribution",
x = "Difference in means (Renovated - Non-renovated House Price (KUSD)",
y = "Frequency") +
theme_minimal() + theme(plot.title = element_text(hjust = 0.5, size = 18, color = "dark blue", face = "bold", family = "serif")))
Figure 3.4.3.2. Null distribution of Difference in means (Renovated house price - Non-renovated house price in KUSD)
#hist(null_distribution_4$stat, main = "Null Distribution", xlab = "Difference in means (Renovated - Non-renovated house price)")
This step is to calculate the observed difference in means between renovated and non-renovated homes based on the actual data. This observed statistic is then used to compare against the null distribution to assess if the difference is statistically significant.
# Calculate the observed statistic
observed_stat_4 <- price_table %>%
specify(price ~ status) %>%
calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
observed_stat_4
## Response: price (numeric)
## Explanatory: status (factor)
## # A tibble: 1 × 1
## stat
## <dbl>
## 1 11.2
**\(p_{value}\) is calculated by
comparing the observed statistic (observed_stat_4) to the
null distribution (null_distribution_4). The p-value is to
assess the likelihood of obtaining a result as extreme as the observed
statistic under the null hypothesis.
# Calculate the p-value
p_value_4 <- null_distribution_4 %>%
get_p_value(obs_stat = observed_stat_4, direction = "both")
p_value_4
## # A tibble: 1 × 1
## p_value
## <dbl>
## 1 0.506
We graphed the null distribution with the observed statistic below:
ggplotly(null_distribution_4 %>%
visualize() +
shade_p_value(obs_stat = observed_stat_4$stat, color = "darkred", direction = "both") + theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold", family = "Times", color = "darkred"),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10)) +
scale_fill_manual(values = c("blue")) +
labs(title = "Null Distribution of Differences in Price Means with Observed Statistic",
x = "Differences in Price Means (KUSD)",
y = "Frequency"))
Figure 3.4.3.4. Null distribution of Difference in means (Renovated house price - Non-renovated house price in KUSD) with Observed Statistic
\(p_{value} > \alpha = 0.05\)
The p-value obtained is greater than the significance level of 0.05. Therefore, we fail to reject the null hypothesis. This indicates that there is insufficient evidence to suggest a significant difference in the average prices of renovated and non-renovated homes in the Seattle Metropolitan Area. Consequently, the data does not support the claim that renovations lead to a statistically significant increase in home prices.
# Set the significance level
alpha <- 0.05
# Conclude the hypothesis testing result
if (p_value_4 < alpha) {
conclusion <- "Reject the null hypothesis: There is a significant difference in mean prices between renovated and non-renovated houses."
} else {
conclusion <- "Fail to reject the null hypothesis: There is no significant difference in mean prices between renovated and non-renovated houses."
}
conclusion
## [1] "Fail to reject the null hypothesis: There is no significant difference in mean prices between renovated and non-renovated houses."
This step generates a bootstrap distribution for the test statistic, which is helpful to estimate the variability of the difference in means between renovated and non-renovated homes based on resampling from the observed data.
If the interval includes zero, it means there might be no real difference between the two groups. If zero is not in the range, it suggests there is a significant difference in prices.
# Generate bootstrap distribution for the test statistic
bootstrap_distribution_4 <- null_hypothesis_4 %>%
specify(price ~ status) %>%
generate(reps = 10000, type = "bootstrap") %>%
calculate(stat = "diff in means", order = c("renovated", "non-renovated"))
# Calculate the confidence interval and store it in ci_4
ci_4 <- bootstrap_distribution_4 %>%
get_ci()
ci_4
## # A tibble: 1 × 2
## lower_ci upper_ci
## <dbl> <dbl>
## 1 -21.1 45.0
ggplot(bootstrap_distribution_4, aes(x = stat)) +
geom_histogram(binwidth = 10, fill = "#81bfda", color = "black", alpha = 0.7) +
geom_vline(aes(xintercept = ci_4$lower_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_vline(aes(xintercept = ci_4$upper_ci), color = "#E38E49", linetype = "solid", size = 2) +
geom_text(aes(x = ci_4$lower_ci, y = Inf, label = "Lower CI"), color = "#E38E49", vjust = -0.5, hjust = 1.1) +
geom_text(aes(x = ci_4$upper_ci, y = Inf, label = "Upper CI"), color = "#E38E49", vjust = -0.5, hjust = -0.1) +
scale_fill_brewer(palette = "Set3") +
labs(title = "Simulation-based Bootstrap Distribution",
x = "Difference in Means",
y = "count", size = 14) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold", family = "Times", color = "darkblue"),
axis.title = element_text(size = 14),
axis.text = element_text(size = 12),
panel.grid.major = element_line(color = "gray80"),
panel.grid.minor = element_line(color = "gray90")) +
annotate("rect", xmin = ci_4$lower_ci, xmax = ci_4$upper_ci, ymin = -Inf, ymax = Inf, alpha = 0.3, fill = "#E38E49")
Figure 3.4.3.5.: Simulation-based Bootstrap Distribution (Hypothesis 4).
The confidence interval ranges from -20.6 to 45.0, indicating that the true difference in price between renovated and non-renovated homes could be as low as -20.6 or as high as 45.0.
Since this interval includes zero, it suggests there is no clear significant difference in price, supporting the conclusion that renovations may not have a substantial impact on home prices in the Seattle Metropolitan Area.
The hypothesis test compared the average prices of renovated and non-renovated homes in the Seattle Metropolitan Area. With a p-value of 0.4782, greater than the significance level of 0.05, we failed to reject the null hypothesis. The bootstrap confidence interval also included zero, indicating no significant difference in prices between the two groups. These results suggest that renovations do not significantly impact home prices in the area.
There are some factors that may explain this test result
The dataset covers the period from 1900 to 2014, so the results may not reflect current market trends. Additionally, construction and renovation costs have changed significantly over time. Major economic events, such as the 2008 financial crisis, may have also influenced and skewed the results.
Since there is no significant difference between renovated and non-renovated houses, buyers can consider both options and focus on other important features such as square footage, waterfront location, number of floors, and overall potential. It’s advisable to prioritize location and long-term value rather than solely focusing on the renovation status.
Salespeople should emphasize the benefits of both renovated and non-renovated homes, helping buyers evaluate each option based on their needs. It’s important to educate buyers about the costs of renovations and the potential for customization in non-renovated homes, while also highlighting the convenience and modern features of renovated homes. Additionally, sales strategies should focus on other key property attributes such as location, square footage, and long-term potential. By doing this, salespeople can build trust in customers.
Hypothesis 1: Our analysis provides strong statistical evidence that the median house size in cities within the Seattle Metropolitan Area is smaller than 2,185 square feet. This is supported by a p-value of 0, leading to rejection of the null hypothesis, and a 95% confidence interval of 1,950 to 2,010 square feet, both below 2,185 square feet.
Hypothesis 2: The p-value of 0.122 (higher than alpha = 0.05) and the bootstrap confidence interval (\(p = 0.06\)) lying within the 95% range \([0.0051, 0.0101]\) suggest that there is no significant difference between the observed proportion of waterfront homes and the hypothesized proportion of 0.6%. We fail to reject the null hypothesis.
Hypothesis 3: The Chi-square test of independence yielded a statistic of 1006.8248 and a very small p-value (essentially 0), which is much lower than the significance level of 0.05. As a result, we reject the null hypothesis and conclude that the house condition (ratings from 1 to 5) is dependent on the house’s age group.
Hypothesis 4: The p-value greater than the significance level of 0.05, and the bootstrap confidence interval including 0, suggest that there is no significant difference in the average prices between renovated and non-renovated homes. We fail to reject the null hypothesis, indicating that renovations do not significantly affect home prices in the Seattle Metropolitan Area.
Consider both the chronological and effective age of a property, as effective age reflects the true remaining life and condition of the home.
Be realistic about the budget, especially when considering waterfront properties, due to their high costs and limited availability.
Consider both renovated and renovated houses because they are not significantly different in price. Furthermore, buys should focus on key features like square footage, location, number of floors, and long-term potential, rather than just the renovation status.
Highlight the premium and scarcity of waterfront properties to attract buyers.
Adjust pricing strategies according to market trends to maximize profit.
Ensure compliance with legal and environmental regulations to avoid future issues.
Educate clients about the limited availability and higher prices of waterfront properties, as well as the legal and environmental risks but also leverage the scarcity of waterfront homes as a premium feature to attract investment buyers.
Diversify the portfolio, focusing more on homes without waterfront, as the majority of transactions involve such properties.
Emphasize the benefits of both renovated and non-renovated homes based on buyer preferences to build trust and meet customer needs.
CoreLogic. (n.d.). Effective age versus actual age. Retrieved December 8, 2024, from https://www.corelogic.com/intelligence/effective-age-versus-actual-age/
FasterCapital. (n.d.). Factors affecting property value in real estate. Retrieved December 8, 2024, from https://fastercapital.com/topics/factors-affecting-property-value-in-real-estate.html
Forbes Home. (n.d.). House styles through the decades. Forbes. Retrieved December 8, 2024, from https://www.forbes.com/home-improvement/design/house-styles-through-decades/
Fox, J., Author, A. the, Jason Fox Facebook Twitter”The best way to find yourself is to lose yourself in the service of others.” ~ Gandhi [ Recognized as a top 3.5% agent in the United States. ] [ Jason Fox was born in Everett, Fox, J., Facebook, Twitter, & “The best way to find yourself is to lose yourself in the service of others.” ~ Gandhi [ Recognized as a top 3.5% agent in the United States. ] [ Jason Fox was born in Everett. (2024, July 25). Why real estate prices are so high in Seattle: A 2024 insight. The Madrona Group | 5 Puget Sound John L. Scott Locations. https://www.themadronagroup.com/why-real-estate-prices-are-so-high-seattle/
Fuerst, Franz, Pat McAllister, Anupam Nanda, and Peter Wyatt. “An investigation of the effect of EPC ratings on house prices.” Department of Energy and Climate Change, June 17, 2013.
Gatea, M. (2021, June 23). Seattle metro lot size decreasing by over 30% over the past two decades. https://www.storagecafe.com/blog/seattle-lot-sizes-drop-to-20-year-lows/
NeoMam Studios. (2022, December 2). The median home size in every U.S. state in 2022. Visual Capitalist. https://www.visualcapitalist.com/cp/median-home-size-every-american-state-2022/
Orton, K. (2015, February 20). The d.c.-area housing market, decoded: A 2014 statistical breakdown by ZIP code - The Washington Post. The Washington Post. https://www.washingtonpost.com/realestate/the-dc-area-housing-market-decoded-a-2014-statistical-breakdown-by-zip/2015/02/20/dbf57e76-b7af-11e4-a200-c008a01a6692_story.html
Rolling Out. (2024, August 24). The impact of gentrification on homeownership. Retrieved December 10, 2024, from https://rollingout.com/2024/08/24/impact-of-gentrification-homeownership/
Shalom, S. (2018, August 20). What size home should I buy? - coldwell banker blue matter blog. Coldwell Banker Blue Matter. https://blog.coldwellbanker.com/size-matters-finding-perfect-size-home/
Soden, |By. (2024, March 6). The hidden risks and liabilities of owning waterfront property. OLIVER L.E. SODEN AGENCY. https://sodeninsurance.com/the-hidden-risks-and-liabilities-of-owning-waterfront-property/
Stefanie Waldek, “How Much Does It Cost to Renovate a House?,” Architectural Digest, January 5, 2023, https://www.architecturaldigest.com/story/cost-to-renovate-a-house.
SuperAdmin. (2022, May 5). Is bigger better? pros & cons of larger homes by Normandy Homes. Normandy Homes. https://normandyhomes.com/lifestyle/is-bigger-better-pros-and-cons-of-buying-a-larger-home/
Today’s Homeowner. (n.d.). What is the median home age in the U.S.? Retrieved December 8, 2024, from https://todayshomeowner.com/home-finances/guides/median-home-age-us/
Xu, L. (2022). The impact of structural and energy-efficient renovations on property values (Doctoral dissertation). University of Padova. https://www.universityofpadova.com/thesis/impact_of_renovations